Introduction

In this EDA, I explore a dataset of airline on-time performance to try and find insights to flight delay and cancellations. The dataset used is a very large dataset that consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. There are over 120 million observations (flights) in this dataset for flights. The data was compressed into individual CSV files for each year.

I chose to explore this particular dataset because it would allow me learn new skills and optimization techniques for handling large datasets.

Due to the size of this dataset, it would very difficult to load the data into a Pandas dataframe in memory without reducing it to a very small subset of the data, so I decided to employ the use of R markdown (instead of Jupyter notebook) so that I can use R packages, along with SQL queries, to wrangle the data into a more summarized format that a Pandas dataframe can handle.

Preliminary Wrangling

Importing R Libraries

library(tidyverse)
library(skimr)
library(dplyr)
library(here) # To locate files based on current working directory
library(janitor) # Tools for for examining and cleaning dirty data.
library(reticulate) # For reading R objects in Python
library(data.table) # For reading large datasets efficiently
library(inborutils) # For reading CSV files and converting to SQL
library(DBI) # Interface to connect with SQL databases
library(RSQLite) # For connecting with SQL databases

Importing Python Packages

import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pathlib
import os

Loading In The Data

This dataset contains 21 large CSV files of flight data for each year from 1987 to 2008, as well as some other CSV files for which contain extra information. I will not directly read any of the large CSV files because it would take too much memory. I will read in the other CSV files, then use R libraries and SQL queries to read in smaller samples of the data to explore.

path_column_data = "airline/airline_dataset_column_info.csv"
path_airport_data = "airline/other_data/airports.csv"
path_carrier_data = "airline/other_data/carriers.csv"
path_plane_data = "airline/other_data/plane-data.csv"
path_main = "airline/main_data" # Path to the main (yearly) CSV files

def absolute_file_paths(directory):
    data = {}
    files = os.listdir(directory)
    paths = [f"{path_main}/{file}" for file in files]
    
    for idx in range(len(files)):
        name = files[idx].split('.')[0]
        data[name] = paths[idx]
    return data

main_files = absolute_file_paths(path_main)
# Data on column descriptions for the main files
column_data = pd.read_csv(path_column_data)

# Data on different airports
airport_data = pd.read_csv(path_airport_data)

# Airline companies
carrier_data = pd.read_csv(path_carrier_data)

# Plane data, specifications and other info
plane_data = pd.read_csv(path_plane_data)
# Info on column descriptions for the main files
column_data <- read.csv(py$path_column_data)

# Data on different airports
airport_data <- read.csv(py$path_airport_data)

# Information on airline companies
carrier_data <- read.csv(py$path_carrier_data)

# Plane data, specifications and other info
plane_data <- read.csv(py$path_plane_data)

For the main data, I have written a script that reads in the data in smaller chunks and stores them in a database file (sqlite). Each year’s data is stored in its own table. I also store the other data in their own tables so that later, when needed, I can reference them using SQL joins. The process takes a while to run because of the large dataset (over 30 minutes on my PC).

Also, the original files are named by the year they represent. It is not be good practice to name a database table starting with a number, so the script adds a prefix to each name.

For now, I have added a condition so the code will only be executed if the sqlite file is not detected in the project root directory.

path_main <- "airline/main_data"
db_file <- "airline_data.sqlite"

save_in_sql <- function() {
  main_files <- list.files(path = path_main, full.names = TRUE)
  
  if (!file.exists(db_file)) {
    # Creating the airport data table
    inborutils::csv_to_sqlite(
                csv_file = py$path_airport_data,
                table_name = "airports",
                sqlite_file = db_file,
                show_progress_bar = FALSE)
    
    # Creating the carrier data table
    inborutils::csv_to_sqlite(
                csv_file = py$path_carrier_data,
                table_name = "carriers",
                sqlite_file = db_file,
                show_progress_bar = FALSE)
    
    # Creating the plane data table
    inborutils::csv_to_sqlite(
                csv_file = py$path_plane_data,
                table_name = "planes",
                sqlite_file = db_file,
                show_progress_bar = FALSE)
    
    # Creating the tables for each of the years' data
    for (csv in main_files) {
      csv_name <- strsplit(csv, "/|[.]")[[1]] # Splitting the csv name by "/" or "."
      csv_name <- csv_name[length(csv_name)-1] # Getting the second last element of the list
      table_name <- paste("table", csv_name, sep="_")
    
      print("Updating table: %s", table_name)
      inborutils::csv_to_sqlite(
                  csv_file = csv,
                  sqlite_file = db_file, 
                  table_name = table_name, 
                  pre_process_size = 1000,
                  chunk_size = 50000, 
                  show_progress_bar = TRUE)
    }
    
  }
}

save_in_sql()

Now I inspect the database file to be sure that all tables have been added and updated properly

airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db

db_tables <- dbListTables(airline_db) # List out the tables in the db
print(db_tables)
##  [1] "airports"     "carriers"     "planes"       "sqlite_stat1" "sqlite_stat4"
##  [6] "table_1987"   "table_1988"   "table_1989"   "table_1990"   "table_1991"  
## [11] "table_1992"   "table_1993"   "table_1994"   "table_1995"   "table_1996"  
## [16] "table_1997"   "table_1998"   "table_1999"   "table_2000"   "table_2001"  
## [21] "table_2002"   "table_2003"   "table_2004"   "table_2005"   "table_2006"  
## [26] "table_2007"   "table_2008"
db_1993_cols <- dbListFields(airline_db, "table_1993") # Column names for specific table in db
print(length(db_1993_cols))
## [1] 29

Structure of the dataset

We can see from the above result that there are 29 columns in the table and this is the same across all the tables (the yearly tables), they all have the same columns, but we don’t know exactly how many rows are in each table.

The code below is a script/query to return exactly the number of rows (observations) that are in each table. The query can take a few minutes to execute the first time.

count_rows <- function() {
    Table = character() # Empty vector/list to store table names
    Row_Count = integer() # Empty vector/list to store row counts

    for (table in db_tables) {
        query_rows <- sprintf("SELECT COUNT(*) AS Rows FROM %s", table)
        row_count <- dbGetQuery(airline_db, query_rows)[[1]]

        Table <- c(Table, table) # Appending each table name to the vector
        Row_Count <- c(Row_Count, row_count) # Appending each row count to the vector
    }

    df_row_count <- data.frame(Table, Row_Count)
    return(df_row_count)
}
table_row_count <- count_rows()
table_row_count
##           Table Row_Count
## 1      airports      3376
## 2      carriers      1491
## 3        planes      5029
## 4  sqlite_stat1        25
## 5  sqlite_stat4         0
## 6    table_1987   1311826
## 7    table_1988   5202096
## 8    table_1989   5041200
## 9    table_1990   5270893
## 10   table_1991   5076925
## 11   table_1992   5092157
## 12   table_1993   5070501
## 13   table_1994   5180048
## 14   table_1995   5327435
## 15   table_1996   5351983
## 16   table_1997   5411843
## 17   table_1998   5384721
## 18   table_1999   5527884
## 19   table_2000   5683047
## 20   table_2001   5967780
## 21   table_2002   5271359
## 22   table_2003   6488540
## 23   table_2004   7129270
## 24   table_2005   7140596
## 25   table_2006   7141922
## 26   table_2007   7453215
## 27   table_2008   2389217

We can now see the number of rows in each table, which sums up to over 120 million observations. To test the SQL connection, I load in the first 500 rows of data from a particular year (2005 dataset in this case) using SQL and the R interface

query_test <- "SELECT * FROM table_2003 LIMIT 10"

tbl(airline_db, sql(query_test)) # Runs the query and displays results without loading it in memory
## # Source:   SQL [10 x 29]
## # Database: sqlite 3.39.2 [O:\GitHub\data_analysis\alx\airline_performance_analysis\airline_data.sqlite]
##     Year Month DayofMo…¹ DayOf…² DepTime CRSDe…³ ArrTime CRSAr…⁴ Uniqu…⁵ Fligh…⁶
##    <dbl> <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>     <dbl>
##  1  2003     1        29       3    1651    1655    1912    1913 UA         1017
##  2  2003     1        30       4    1654    1655    1910    1913 UA         1017
##  3  2003     1        31       5    1724    1655    1936    1913 UA         1017
##  4  2003     1         1       3    1033    1035    1625    1634 UA         1018
##  5  2003     1         2       4    1053    1035    1726    1634 UA         1018
##  6  2003     1         3       5    1031    1035    1640    1634 UA         1018
##  7  2003     1         4       6    1031    1035    1626    1634 UA         1018
##  8  2003     1         5       7    1035    1035    1636    1634 UA         1018
##  9  2003     1         6       1    1031    1035    1653    1634 UA         1018
## 10  2003     1         1       3    1713    1710    1851    1847 UA         1020
## # … with 19 more variables: TailNum <chr>, ActualElapsedTime <dbl>,
## #   CRSElapsedTime <dbl>, AirTime <dbl>, ArrDelay <dbl>, DepDelay <dbl>,
## #   Origin <chr>, Dest <chr>, Distance <dbl>, TaxiIn <dbl>, TaxiOut <dbl>,
## #   Cancelled <dbl>, CancellationCode <int>, Diverted <dbl>,
## #   CarrierDelay <int>, WeatherDelay <int>, NASDelay <int>,
## #   SecurityDelay <int>, LateAircraftDelay <int>, and abbreviated variable
## #   names ¹​DayofMonth, ²​DayOfWeek, ³​CRSDepTime, ⁴​CRSArrTime, ⁵​UniqueCarrier, …
top_rows <- dbGetQuery(airline_db, query_test) # Runs the query and stores it in a dataframe, in memory♂

dbDisconnect(airline_db) # Disconnect from the database when done

Now lets look at some summary statistics for the data. I will read in the first 5,000 rows of the 2007 flight data.

top_2007 = pd.read_csv(main_files['2007'], nrows=5000)
top_2007.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 5000 entries, 0 to 4999
## Data columns (total 29 columns):
##  #   Column             Non-Null Count  Dtype  
## ---  ------             --------------  -----  
##  0   Year               5000 non-null   int64  
##  1   Month              5000 non-null   int64  
##  2   DayofMonth         5000 non-null   int64  
##  3   DayOfWeek          5000 non-null   int64  
##  4   DepTime            4964 non-null   float64
##  5   CRSDepTime         5000 non-null   int64  
##  6   ArrTime            4963 non-null   float64
##  7   CRSArrTime         5000 non-null   int64  
##  8   UniqueCarrier      5000 non-null   object 
##  9   FlightNum          5000 non-null   int64  
##  10  TailNum            5000 non-null   object 
##  11  ActualElapsedTime  4963 non-null   float64
##  12  CRSElapsedTime     5000 non-null   int64  
##  13  AirTime            4963 non-null   float64
##  14  ArrDelay           4963 non-null   float64
##  15  DepDelay           4964 non-null   float64
##  16  Origin             5000 non-null   object 
##  17  Dest               5000 non-null   object 
##  18  Distance           5000 non-null   int64  
##  19  TaxiIn             5000 non-null   int64  
##  20  TaxiOut            5000 non-null   int64  
##  21  Cancelled          5000 non-null   int64  
##  22  CancellationCode   36 non-null     object 
##  23  Diverted           5000 non-null   int64  
##  24  CarrierDelay       5000 non-null   int64  
##  25  WeatherDelay       5000 non-null   int64  
##  26  NASDelay           5000 non-null   int64  
##  27  SecurityDelay      5000 non-null   int64  
##  28  LateAircraftDelay  5000 non-null   int64  
## dtypes: float64(6), int64(18), object(5)
## memory usage: 1.1+ MB

Most of the columns are numeric, some indicating arrival and departure, as well as different causes of delays. There are some binary columns such as “Cancelled” and “Diverted” which are important variables to analyze.

Features of interest

For this EDA, I am interested in exploring some of the ideas suggested on the source website which are:

  1. When is the best time of day/day of week/time of year to fly to minimize delays?
  2. Do older planes suffer more delays?
  3. How well does weather predict plane delays?

Generally, I am interested in exploring the cause of flight delays and cancellations.

Areas To Focus On

For this investigation, I will get the best insights by focusing on the “delay” columns. By analyzing the delays on each day of the week and each month, I believe I can get a good idea of the best times to fly. I will explore data for a single year. Then later on, I will compare the data across the other years to see if there are similar patterns across the years.

Univariate Exploration

I will start by analyzing the reason for flight cancellations: CancellationCode column. I will be using the 2007 dataset. Because the data is so large, I will only read in some select columns into the dataframe.

Reasons For Cancelled Flights

get_cancelled_flights <- function(year){
    airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db
    s <- sprintf("CASE 
                    WHEN DayOfWeek = 1 Then 'Monday'
                    WHEN DayOfWeek = 2 Then 'Tuesday'
                    WHEN DayOfWeek = 3 Then 'Wednesday'
                    When DayOfWeek = 4 Then 'Thursday'
                    When DayOfWeek = 5 Then 'Friday'
                    When DayOfWeek = 6 Then 'Saturday'
                    When DayOfWeek = 7 Then 'Sunday'
                  END AS Day")
    query_cc <- sprintf("SELECT Month, DayofMonth, DayOfWeek, %s, CancellationCode, COUNT(*) AS Flights
                        FROM table_%s
                        WHERE Cancelled = 1
                        GROUP BY Month, DayofMonth, DayOfWeek, Day, CancellationCode", s, year)
    cancelled_flights <- dbGetQuery(airline_db, query_cc) # Runs the query and stores it in a dataframe, in memory♂
    dbDisconnect(airline_db) # Disconnect from the database when done
    return(cancelled_flights)
}

cancelled_flights_2007 = get_cancelled_flights('2007')
head(cancelled_flights_2007)
##   Month DayofMonth DayOfWeek     Day CancellationCode Flights
## 1     1          1         1  Monday                A     268
## 2     1          1         1  Monday                B      56
## 3     1          1         1  Monday                C      35
## 4     1          2         2 Tuesday                A     167
## 5     1          2         2 Tuesday                B      21
## 6     1          2         2 Tuesday                C      23
# Converting R dataframe to Pandas dataframe
cancelled_flights_2007 = pd.DataFrame(r.cancelled_flights_2007)
cancelled_flights_2007.info()

# Changing numeric data types to integer
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 1095 entries, 0 to 1094
## Data columns (total 6 columns):
##  #   Column            Non-Null Count  Dtype  
## ---  ------            --------------  -----  
##  0   Month             1095 non-null   float64
##  1   DayofMonth        1095 non-null   float64
##  2   DayOfWeek         1095 non-null   float64
##  3   Day               1095 non-null   object 
##  4   CancellationCode  1095 non-null   object 
##  5   Flights           1095 non-null   int32  
## dtypes: float64(3), int32(1), object(2)
## memory usage: 47.2+ KB
cancelled_flights_2007 = cancelled_flights_2007.astype({'Month':'int8', 'DayofMonth':'int8', 'DayOfWeek':'int8'})
cancelled_flights_2007.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 1095 entries, 0 to 1094
## Data columns (total 6 columns):
##  #   Column            Non-Null Count  Dtype 
## ---  ------            --------------  ----- 
##  0   Month             1095 non-null   int8  
##  1   DayofMonth        1095 non-null   int8  
##  2   DayOfWeek         1095 non-null   int8  
##  3   Day               1095 non-null   object
##  4   CancellationCode  1095 non-null   object
##  5   Flights           1095 non-null   int32 
## dtypes: int32(1), int8(3), object(2)
## memory usage: 24.7+ KB
labels = ['Carrier', 'Weather', 'NAS', 'Security']
sns.set_theme(style="darkgrid", palette=None, font_scale=1.5)
base_color = sns.color_palette("husl", 9)

def pie_cancel():
    plt.figure(figsize=[16, 10])
    cg = cancelled_flights_2007.groupby("CancellationCode").sum()["Flights"]
    plt.pie(x=cg, labels=labels, startangle=90, counterclock=False, autopct='%.0f%%');
    plt.axis("square")
    plt.show();

pie_cancel()

We can see from the chart that most of the cancelled flights in 2007 were as a result of carrier delays, followed closely by weather delays. Lets see if its the same trend across other years. I will be working with a 4 year period (2005 to 2008).

cancelled_flights_2005 = get_cancelled_flights('2005')
cancelled_flights_2006 = get_cancelled_flights('2006')
cancelled_flights_2007 = get_cancelled_flights('2007')
cancelled_flights_2008 = get_cancelled_flights('2008')
cancelled_flights_2005 = pd.DataFrame(r.cancelled_flights_2005)
cancelled_flights_2006 = pd.DataFrame(r.cancelled_flights_2006)
cancelled_flights_2007 = pd.DataFrame(r.cancelled_flights_2007)
cancelled_flights_2008 = pd.DataFrame(r.cancelled_flights_2008)
cancelled_list = [cancelled_flights_2005, cancelled_flights_2006, cancelled_flights_2007, cancelled_flights_2008]
cancelled_list_str = ['cancelled_flights_2005', 'cancelled_flights_2006', 'cancelled_flights_2007', 'cancelled_flights_2008']

def save_to_csv(df_list):
    for idx, df in enumerate(df_list):
        os.makedirs('airline/saved_data/', exist_ok=True)
        df.to_csv(f'airline/saved_data/{cancelled_list_str[idx]}.csv', index=False)

# save_to_csv(cancelled_list)
def pie_multi():
    plt.figure(figsize=[16, 10])
    
    plt.subplot(2, 2, 1)
    cg3 = cancelled_flights_2005.groupby("CancellationCode").sum()["Flights"]
    plt.pie(x=cg3, labels=labels, startangle=90, counterclock=False, autopct='%.0f%%');
    
    plt.subplot(2, 2, 2)
    cg4 = cancelled_flights_2006.groupby("CancellationCode").sum()["Flights"]
    plt.pie(x=cg4, labels=labels, startangle=90, counterclock=False, autopct='%.0f%%');
    
    plt.subplot(2, 2, 3)
    cg5 = cancelled_flights_2007.groupby("CancellationCode").sum()["Flights"]
    plt.pie(x=cg5, labels=labels, startangle=90, counterclock=False, autopct='%.0f%%');
    
    plt.subplot(2, 2, 4)
    cg6 = cancelled_flights_2008.groupby("CancellationCode").sum()["Flights"]
    plt.pie(x=cg6, labels=labels, startangle=90, counterclock=False, autopct='%.0f%%');
    
    plt.show()

pie_multi()

We can see that carrier is consistently a major reason for cancelled flights followed by weather.

Departure Delay Times

Here I try to see the delay times to see the most common delay duration.

df_2007 = pd.read_csv(main_files['2007'], usecols = ['Month', 'DayofMonth', 'ArrDelay', 'DepDelay', 'Distance'], nrows = 2_000_000)
df_2007.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2000000 entries, 0 to 1999999
## Data columns (total 5 columns):
##  #   Column      Dtype  
## ---  ------      -----  
##  0   Month       int64  
##  1   DayofMonth  int64  
##  2   ArrDelay    float64
##  3   DepDelay    float64
##  4   Distance    int64  
## dtypes: float64(2), int64(3)
## memory usage: 76.3 MB
df_2007.describe()

# Check for missing values
##               Month    DayofMonth      ArrDelay      DepDelay      Distance
## count  2.000000e+06  2.000000e+06  1.935147e+06  1.939160e+06  2.000000e+06
## mean   2.182453e+00  1.555414e+01  1.050200e+01  1.184517e+01  7.000558e+02
## std    9.717774e-01  8.695236e+00  3.906322e+01  3.582602e+01  5.458290e+02
## min    1.000000e+00  1.000000e+00 -3.120000e+02 -3.050000e+02  1.100000e+01
## 25%    1.000000e+00  8.000000e+00 -9.000000e+00 -4.000000e+00  3.080000e+02
## 50%    2.000000e+00  1.600000e+01  0.000000e+00  0.000000e+00  5.460000e+02
## 75%    3.000000e+00  2.300000e+01  1.500000e+01  1.200000e+01  9.300000e+02
## max    4.000000e+00  3.100000e+01  1.564000e+03  1.547000e+03  4.962000e+03
df_2007['DepDelay'].isnull().sum()
## 60840
df_2007[df_2007.ArrDelay.notnull()]
##          Month  DayofMonth  ArrDelay  DepDelay  Distance
## 0            1           1       1.0       7.0       389
## 1            1           1       8.0      13.0       479
## 2            1           1      34.0      36.0       479
## 3            1           1      26.0      30.0       479
## 4            1           1      -3.0       1.0       479
## ...        ...         ...       ...       ...       ...
## 1999995      4          30      22.0       0.0       741
## 1999996      4          30      48.0      50.0      1024
## 1999997      4          30     -12.0      -1.0       646
## 1999998      4          30      12.0      20.0       589
## 1999999      4          30     -21.0      -5.0       461
## 
## [1935147 rows x 5 columns]
# Changing column data types to reduce memory usage
df_2007 = df_2007.astype({'Month':'int8', 'DayofMonth':'int8', 'ArrDelay':'float32', 'DepDelay':'float32'})
def delay_hist():
    binsize = 30
    bins = np.arange(0, df_2007['DepDelay'].max()+binsize, binsize)
    
    plt.figure(figsize=[16, 10]);
    plt.hist(data = df_2007, x = 'DepDelay', bins = bins);
    plt.xlabel('Departure Delays (mins)');
    plt.show()
    
delay_hist()

The distribution is skewed to the left and there is a short tail. A large majority of the data falls within the range of 0 and 250 minutes. I would have gone for a logarithmic scale but this data has negative values (because there are flights that took off before the expected departure time). There are also some missing values.

# Flights that took off over 25 minutes earlier
df_2007[df_2007.DepDelay < -25].info()

# Lowest departure delay
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 165 entries, 99831 to 1977687
## Data columns (total 5 columns):
##  #   Column      Non-Null Count  Dtype  
## ---  ------      --------------  -----  
##  0   Month       165 non-null    int8   
##  1   DayofMonth  165 non-null    int8   
##  2   ArrDelay    163 non-null    float32
##  3   DepDelay    165 non-null    float32
##  4   Distance    165 non-null    int64  
## dtypes: float32(2), int64(1), int8(2)
## memory usage: 4.2 KB
df_2007[df_2007.DepDelay < 0].DepDelay.min()
## -305.0

We can see that there are many flights that took off before the expected departure time (almost 50% of all the flights in that year). That is not unusual, especially if it falls within a few minutes and all passengers are available, but there are many flights that took off unusually early (over 30 minutes early, even up to 5 hours early). There can be many reasons for this but for now since this exploration is mainly focused on delay times and there are so many records to work with, I will only assess flights that were actually delayed,

df_2007_delayed = df_2007[df_2007.DepDelay > 0]
df_2007_delayed.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 867613 entries, 0 to 1999998
## Data columns (total 5 columns):
##  #   Column      Non-Null Count   Dtype  
## ---  ------      --------------   -----  
##  0   Month       867613 non-null  int8   
##  1   DayofMonth  867613 non-null  int8   
##  2   ArrDelay    865260 non-null  float32
##  3   DepDelay    867613 non-null  float32
##  4   Distance    867613 non-null  int64  
## dtypes: float32(2), int64(1), int8(2)
## memory usage: 21.5 MB
df_2007_delayed.describe()
##                Month     DayofMonth  ...       DepDelay       Distance
## count  867613.000000  867613.000000  ...  867613.000000  867613.000000
## mean        2.199465      15.268106  ...      31.058889     732.023085
## std         0.961845       8.696432  ...      46.776241     554.655358
## min         1.000000       1.000000  ...       1.000000      11.000000
## 25%         1.000000       8.000000  ...       5.000000     328.000000
## 50%         2.000000      15.000000  ...      15.000000     587.000000
## 75%         3.000000      22.000000  ...      38.000000     967.000000
## max         4.000000      31.000000  ...    1547.000000    4962.000000
## 
## [8 rows x 5 columns]

Now I will try to plot using a log scale

def log_hist():
    log_binsize = 0.10
    bins = 10 ** np.arange(0, np.log10(df_2007_delayed['DepDelay'].max())+log_binsize, log_binsize)
    
    plt.figure(figsize=[16, 10])
    plt.hist(data = df_2007_delayed, x = 'DepDelay', bins = bins)
    plt.xscale('log')
    # plt.xticks([500, 1e3, 2e3, 5e3, 1e4, 2e4], [50, '100', '200', '400', '800', '1600'])
    plt.xlabel('Departure Delays (mins)')
    plt.show()
    
log_hist()

Bivariate and Multivarite Exploration

I will look at the departure delays and arrival delays. Specifically, I will focus on the arrival delay and its relationship with distance. I want to see if longer flights tend to have longer delays.

df_2007_sampled = df_2007.dropna(subset=['ArrDelay', 'Distance']).sample(n=20000, replace = False)
def dist_vs_arr():
    # Dropping rows with missing values
    print("Original rows and columns =",df_2007.shape)
    print("Sampled rows and columns =",df_2007_sampled.shape)
    
    print(df_2007_sampled.info())
    print(df_2007_sampled.head())
    
    plt.figure(figsize=[12, 12])
    plt.scatter(data=df_2007_sampled, x='Distance', y='ArrDelay')
    plt.xlabel("Distance in miles")
    plt.ylabel("Arrival Delay in minutes")
    plt.show()
    
dist_vs_arr()
## Original rows and columns = (2000000, 5)
## Sampled rows and columns = (20000, 5)
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 20000 entries, 1917893 to 1931736
## Data columns (total 5 columns):
##  #   Column      Non-Null Count  Dtype  
## ---  ------      --------------  -----  
##  0   Month       20000 non-null  int8   
##  1   DayofMonth  20000 non-null  int8   
##  2   ArrDelay    20000 non-null  float32
##  3   DepDelay    20000 non-null  float32
##  4   Distance    20000 non-null  int64  
## dtypes: float32(2), int64(1), int8(2)
## memory usage: 507.8 KB
## None
##          Month  DayofMonth  ArrDelay  DepDelay  Distance
## 1917893      4          30       3.0      -2.0       236
## 769614       2           8      13.0      15.0       793
## 985675       2           3     109.0      89.0      1266
## 1699625      3          27      17.0      11.0      1345
## 863445       2          17     206.0     229.0      1515

There seems to be very little correlation from this plot. Since there are a lot of overlapping points, I will apply some transparency to get a better picture.

def dist_vs_arr_blur():
    plt.figure(figsize=[12, 12])
    sns.regplot(data=df_2007_sampled, x='Distance', y='ArrDelay', scatter_kws = {'alpha': 1/5}, fit_reg = False);
    plt.xlabel("Distance in miles")
    plt.ylabel("Arrival Delay in minutes")
    plt.show()

dist_vs_arr_blur()

Most of the flights had distances between the range of 2500 miles and delays within 400 minutes. Again I will be removing the negative values since I am only interested in flights that arrived later than expected.

df_2007_sampled = df_2007_sampled.query("0 < ArrDelay < 200")

def dist_vs_arr_focus():
    plt.figure(figsize=[12, 12])
    sns.regplot(data=df_2007_sampled, x='Distance', y='ArrDelay', scatter_kws = {'alpha': 1/5}, fit_reg = False);
    plt.show();

dist_vs_arr_focus()

Many of the flights with shorter distances seemed to have shorter delays since the density of the dots seem to fade as it gets away from the 0 mark. Lets try to get a different picture with another plot.

def heat_plot():
    x_bins = np.arange(0, 4500+100, 100)
    y_bins = np.arange(0, 200+7, 7)
    
    plt.figure(figsize=[12, 12])
    plt.hist2d(data=df_2007_sampled, x='Distance', y='ArrDelay', cmin=0.3, cmap='viridis_r', bins = [x_bins, y_bins]);
    plt.show();

heat_plot()
## <string>:6: MatplotlibDeprecationWarning: Auto-removal of grids by pcolor() and pcolormesh() is deprecated since 3.5 and will be removed two minor releases later; please call grid(False) first.

Its still the same story from what I can see here.

Day of Week Analysis

Next, I will analyze the delays on a day-of-week basis.

get_mean_delay_year <- function(year) {
    airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db
    
    # Query for mean delay times in January (actual delays, no early flights)
    query_jan <- sprintf("SELECT DayOfWeek, DayOfMonth, 
                                 AVG(DepDelay) AS MeanDepDelay, 
                                 AVG(ArrDelay) AS MeanArrDelay
                          FROM table_%s
                          WHERE DepDelay > 0 AND ArrDelay > 0
                          GROUP BY DayOfWeek, DayOfMonth", year)
    
    # tbl(airline_db, sql(query_jan)) # Runs the query and displays results without loading it in memory
    delays_2007 <- dbGetQuery(airline_db, query_jan) # Runs the query and stores it in a dataframe, in memory♂
    dbDisconnect(airline_db) # Disconnect from the database when done
    return(delays_2007)
}
delays_2007 <- get_mean_delay_year('2007')
head(delays_2007)
##   DayOfWeek DayofMonth MeanDepDelay MeanArrDelay
## 1         1          1     35.50271     35.54895
## 2         1          2     40.18925     41.42904
## 3         1          3     38.94795     39.51203
## 4         1          4     46.83710     49.46875
## 5         1          5     36.06113     37.46486
## 6         1          6     42.34605     44.32375
skim(delays_2007)
Data summary
Name delays_2007
Number of rows 216
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
DayOfWeek 0 1 3.99 2.00 1.00 2.00 4.00 6.00 7.00 ▇▃▃▃▇
DayofMonth 0 1 15.93 8.93 1.00 8.00 16.00 24.00 31.00 ▇▇▇▇▇
MeanDepDelay 0 1 38.37 6.48 24.67 33.25 37.68 41.89 61.84 ▃▇▆▁▁
MeanArrDelay 0 1 40.66 7.88 24.93 34.60 39.75 45.29 70.32 ▅▇▅▁▁

For the DayOfWeek data, I will make another column in the dataframe that shows the text representation (Monday, Tuesday …) so that it would be easier to understand in the plot.

def change_column_type(df):
    # Converting R dataframe to Pandas dataframe
    df_delays = pd.DataFrame(df)
    
    # Changing day and month columns from float to integer data types
    df_delays = df_delays.astype({'DayOfWeek':'int8', 'DayofMonth':'int8'})
    
    days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    
    # Creating the new column
    df_delays['Day'] = df_delays['DayOfWeek'].apply(lambda x: days_of_week[x-1])
    
    return(df_delays)

delays_2007 = change_column_type(r.delays_2007)
delays_2007.head()
##    DayOfWeek  DayofMonth  MeanDepDelay  MeanArrDelay     Day
## 0          1           1     35.502709     35.548949  Monday
## 1          1           2     40.189248     41.429041  Monday
## 2          1           3     38.947953     39.512035  Monday
## 3          1           4     46.837098     49.468746  Monday
## 4          1           5     36.061127     37.464863  Monday
def plot_bar_err_single():
    plt.figure(figsize=[16,12])
    sns.barplot(data=delays_2007, x='Day', y='MeanDepDelay')
    plt.title("Average Flight Delay Times In 2007")
    plt.xlabel("Day of Week")
    plt.ylabel("Average Departure Delay (min)")
    plt.show();

plot_bar_err_single()

From the above chart alone, the average delay times for each weekday seems to be fairly the same. I will see if there is a consistent pattern across the years by plotting the chart for 9 consecutive years (2000 - 2008)

I left the colors because, even though the weekday variable is ordinal (i.e. Tuesday comes after Monday and so on), the order doesn’t really matter much in this case because, for example, Friday is not better than Sunday, Monday is not higher than Saturday, etc. The colors will be helpful in identifying each weekday in the subsequent plots.

delays_2000 <- get_mean_delay_year('2000')
delays_2001 <- get_mean_delay_year('2001')
delays_2002 <- get_mean_delay_year('2002')
delays_2003 <- get_mean_delay_year('2003')
delays_2004 <- get_mean_delay_year('2004')
delays_2005 <- get_mean_delay_year('2005')
delays_2006 <- get_mean_delay_year('2006')
delays_2007 <- get_mean_delay_year('2007')
delays_2008 <- get_mean_delay_year('2008')
delays_2000 = change_column_type(r.delays_2000)
delays_2001 = change_column_type(r.delays_2001)
delays_2002 = change_column_type(r.delays_2002)
delays_2003 = change_column_type(r.delays_2003)
delays_2004 = change_column_type(r.delays_2004)
delays_2005 = change_column_type(r.delays_2005)
delays_2006 = change_column_type(r.delays_2006)
delays_2007 = change_column_type(r.delays_2007)
delays_2008 = change_column_type(r.delays_2008)
delays_list = [delays_2000, delays_2001, delays_2002, delays_2003, delays_2004, delays_2005, delays_2006, delays_2007, delays_2008]
delays_list_str = ['delays_2000', 'delays_2001', 'delays_2002', 'delays_2003', 'delays_2004', 'delays_2005', 'delays_2006', 'delays_2007', 'delays_2008']

def save_to_csv(df_list):
    for idx, df in enumerate(df_list):
        os.makedirs('airline/saved_data/', exist_ok=True)
        df.to_csv(f'airline/saved_data/{delays_list_str[idx]}.csv', index=False)

# save_to_csv(delays_list)
def bar_plot_err_multi():
    fig, ax = plt.subplots(ncols = 3, nrows = 3 , figsize = [22,17])
    
    sns.barplot(data=delays_2000, x='Day', y='MeanDepDelay', ax = ax[0, 0])
    # plt.title(f"Average Flight Delay Times In {delay_tables}")
    sns.barplot(data=delays_2001, x='Day', y='MeanDepDelay', ax = ax[1, 0])
    # plt.title(f"Average Flight Delay Times In {delay_tables}")
    sns.barplot(data=delays_2002, x='Day', y='MeanDepDelay', ax = ax[2, 0])
    # plt.title(f"Average Flight Delay Times In {delay_tables}")
    sns.barplot(data=delays_2003, x='Day', y='MeanDepDelay', ax = ax[0, 1])
    # plt.title(f"Average Flight Delay Times In {delay_tables}")
    sns.barplot(data=delays_2004, x='Day', y='MeanDepDelay', ax = ax[1, 1])
    # plt.title(f"Average Flight Delay Times In {delay_tables}")
    sns.barplot(data=delays_2005, x='Day', y='MeanDepDelay', ax = ax[2, 1])
    # plt.title(f"Average Flight Delay Times In {delay_tables}")
    sns.barplot(data=delays_2006, x='Day', y='MeanDepDelay', ax = ax[0, 2])
    # plt.title(f"Average Flight Delay Times In {delay_tables}")
    sns.barplot(data=delays_2007, x='Day', y='MeanDepDelay', ax = ax[1, 2])
    # plt.title(f"Average Flight Delay Times In {delay_tables}")
    sns.barplot(data=delays_2008, x='Day', y='MeanDepDelay', ax = ax[2, 2])
    # plt.title(f"Average Flight Delay Times In {delay_tables}")
    
    plt.show();

bar_plot_err_multi()    

From an overview of the chart above, there is no consistent trend to predict which weekdays have more delays. Though if we look closely, it looks like Thursday generally has more delays but its not very obvious. Also, the minimum and maximum delay times are all within a small range of 30 to 40 minutes so there is very little variation. This is expected because there are many other factors to consider like the month, the season, holidays, airport carrier, plane age and global events.

Cancelled and Diverted

Now I will examine the cancelled and diverted flights, relative to the carriers (airline companies). I want to see if flights from a carrier tend to get get cancelled or diverted more than others.

Below is an SQL query to get all flights that were either diverted or cancelled and group them by the flight carrier. On the original table, there is a column for diverted (1 or 0) and another column for cancelled (1 or 0). I believe the data is not completely tidy because a flight that is cancelled cannot be diverted and vice-versa. So I combined them to a single column indicating whether the flight was cancelled or diverted.

get_changed_flights <- function(year) {
    airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db
    query <- sprintf("SELECT 
                        UniqueCarrier,
                        Description AS Carrier,
                        CASE 
                            WHEN Diverted = 1 AND Cancelled = 0 THEN 'Diverted'
                            WHEN Diverted = 0 AND Cancelled = 1 THEN 'Cancelled'
                        END AS FlightStatus,
                        COUNT(*) AS Flights
                      FROM table_%s
                      LEFT JOIN carriers
                      ON table_%s.UniqueCarrier = carriers.Code
                      WHERE Diverted = 1 OR Cancelled = 1
                      GROUP BY UniqueCarrier, Carrier, FlightStatus
                      ORDER BY UniqueCarrier", year, year)
    
    flights_changed <- dbGetQuery(airline_db, query) # Runs the query and stores it in a dataframe, in memory♂
    dbDisconnect(airline_db) # Disconnect from the database when done
    return(flights_changed)
}

flights_changed_2007 <- get_changed_flights('2007')
head(flights_changed_2007, 10)
##    UniqueCarrier                Carrier FlightStatus Flights
## 1             9E Pinnacle Airlines Inc.    Cancelled    7939
## 2             9E Pinnacle Airlines Inc.     Diverted     726
## 3             AA American Airlines Inc.    Cancelled   17924
## 4             AA American Airlines Inc.     Diverted    2097
## 5             AQ    Aloha Airlines Inc.    Cancelled     388
## 6             AQ    Aloha Airlines Inc.     Diverted      15
## 7             AS   Alaska Airlines Inc.    Cancelled    2563
## 8             AS   Alaska Airlines Inc.     Diverted     499
## 9             B6        JetBlue Airways    Cancelled    3710
## 10            B6        JetBlue Airways     Diverted     624

Lets see the number of cancelled and diverted flights in this data (the 2007 data).

flights_changed_2007 = pd.DataFrame(r.flights_changed_2007)

print(f"Total number of diverted or cancelled flights in 2007: {flights_changed_2007.Flights.sum()}")
## Total number of diverted or cancelled flights in 2007: 177927
flights_changed_2007.head()
##   UniqueCarrier                 Carrier FlightStatus  Flights
## 0            9E  Pinnacle Airlines Inc.    Cancelled     7939
## 1            9E  Pinnacle Airlines Inc.     Diverted      726
## 2            AA  American Airlines Inc.    Cancelled    17924
## 3            AA  American Airlines Inc.     Diverted     2097
## 4            AQ     Aloha Airlines Inc.    Cancelled      388
def plot_vertical_clustered():
    plt.figure(figsize=[16,12])
    sns.barplot(data = flights_changed_2007, x = 'UniqueCarrier', y = 'Flights', hue = 'FlightStatus')
    plt.title("Cancelled & Diverted Flights From Each Carrier In 2007")
    plt.show()
    
plot_vertical_clustered()

We can already see that MQ had the highest number of cancelled flights (by a relatively wide margin). To better understand the plot, I will make it horizontal and order it by number of cancelled flights.

flights_changed_2007 = flights_changed_2007.sort_values(by='Flights', ascending=False)

def plot_horizontal_clustered():
    #Sort by descending order of number of flights
    
    plt.figure(figsize=[20,14])
    
    sns.barplot(data = flights_changed_2007, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
    plt.title("Cancelled & Diverted Flights From Each Carrier In 2007")
    plt.show()

plot_horizontal_clustered()

Its easier to see that the airline companies with the most number of cancelled flight is American Eagle Airlines Inc followed by American Ailines and for diverted flights we have American Airlines and Southwest Airlines Co.. So we can at least judge that the carriers at the bottom of the chart have a good record of flight data (though I am aware that the chart could be like that because those companies at the bottom do not have as many flights as those at the top).

Now I’m going to plot the chart for the most recent 6 years from the dataset, to see if this trend is the same across the years, i.e, to see if the same companies are always on top.

flights_changed_2003 = get_changed_flights('2003') 
flights_changed_2004 = get_changed_flights('2004') 
flights_changed_2005 = get_changed_flights('2005') 
flights_changed_2006 = get_changed_flights('2006') 
flights_changed_2007 = get_changed_flights('2007') 
flights_changed_2008 = get_changed_flights('2008') 
flights_changed_2003 =pd.DataFrame(r.flights_changed_2003).sort_values(by='Flights', ascending=False)
flights_changed_2004 =pd.DataFrame(r.flights_changed_2004).sort_values(by='Flights', ascending=False)
flights_changed_2005 =pd.DataFrame(r.flights_changed_2005).sort_values(by='Flights', ascending=False)
flights_changed_2006 =pd.DataFrame(r.flights_changed_2006).sort_values(by='Flights', ascending=False)
flights_changed_2007 =pd.DataFrame(r.flights_changed_2007).sort_values(by='Flights', ascending=False)
flights_changed_2008 =pd.DataFrame(r.flights_changed_2008).sort_values(by='Flights', ascending=False)
flight_changed = [flights_changed_2003, flights_changed_2004, flights_changed_2005, flights_changed_2006, flights_changed_2007, flights_changed_2008]
flight_changed_str = ['flights_changed_2003', 'flights_changed_2004', 'flights_changed_2005', 'flights_changed_2006', 'flights_changed_2007', 'flights_changed_2008']

def save_to_csv(df_list):
    for idx, df in enumerate(df_list):
        os.makedirs('airline/saved_data/', exist_ok=True)
        df.to_csv(f'airline/saved_data/{flight_changed_str[idx]}.csv', index=False)

# save_to_csv(flight_changed)
import PyQt5

def carrier_multi():
    plt.figure(figsize=(22,22))
    base_color = sns.color_palette()[2]
    
    plt.suptitle("2003 to 2006")
    
    plt.subplot(3, 2, 1)
    sns.barplot(data = flights_changed_2003, y = 'UniqueCarrier', x = 'Flights', hue = 'FlightStatus')
    plt.title("2003")
    plt.subplot(3, 2, 2)
    sns.barplot(data = flights_changed_2004, y = 'UniqueCarrier', x = 'Flights', hue = 'FlightStatus')
    plt.title("2004")
    plt.subplot(3, 2, 3)
    sns.barplot(data = flights_changed_2005, y = 'UniqueCarrier', x = 'Flights', hue = 'FlightStatus')
    plt.title("2005")
    plt.subplot(3, 2, 4)
    sns.barplot(data = flights_changed_2006, y = 'UniqueCarrier', x = 'Flights', hue = 'FlightStatus')
    plt.title("2006")
    plt.subplot(3, 2, 5)
    sns.barplot(data = flights_changed_2007, y = 'UniqueCarrier', x = 'Flights', hue = 'FlightStatus')
    plt.title("2007")
    plt.subplot(3, 2, 6)
    sns.barplot(data = flights_changed_2008, y = 'UniqueCarrier', x = 'Flights', hue = 'FlightStatus')
    plt.title("2008")
    
    # Hiding the axis ticks and tick labels of the bigger plot
    plt.tick_params(labelcolor="none", bottom=False, left=False)
    
    # Adding the x-axis and y-axis labels for the bigger plot
    plt.xlabel('Common X-Axis', fontsize=15, fontweight='bold')
    plt.ylabel('Common Y-Axis', fontsize=15, fontweight='bold')
    
    plt.show()

carrier_multi()

For the above chart, I changed the y-labels back to the short form so that it can fit. We can see that there are companies that consistently appear among the top 5 for cancelled flights, American Eagle Airlines and American Airlines for example.

Next I want to analyze the number of delays from a general perspective. With the level of advancements in technology, I expect to see a relative reduction in the percentage of delayed flights each year.

The following query takes over 20 minutes to run so I saved the data in a CSV file after running it, so that I don’t have to rerun it unless I change something.

get_all_flights <- function() {
    airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db

    s <- "Year, 
          CASE 
              WHEN Diverted = 1 AND Cancelled = 0 THEN 'Diverted'
              WHEN Diverted = 0 AND Cancelled = 1 THEN 'Cancelled'
              ELSE 'Arrived'
          END AS FlightStatus,
          COUNT(*) AS Flights"
    
    query <- sprintf("
        SELECT %s FROM table_1987 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_1988 GROUP BY Year, FlightStatus
        UNION ALL 
        SELECT %s FROM table_1989 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_1990 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_1991 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_1992 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_1993 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_1994 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_1995 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_1996 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_1997 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_1998 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_1999 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_2000 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_2001 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_2002 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_2003 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_2004 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_2005 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_2006 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_2007 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_2008 GROUP BY Year, FlightStatus", 
        s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s)
    
    all_flights <- dbGetQuery(airline_db, query) # Runs the query and stores it in a dataframe, in memory♂
    dbDisconnect(airline_db) # Disconnect from the database when done
    return(all_flights)
}
# all_flights <- get_all_flights()
# write.csv(all_flights,"airline/saved_data/all_flights.csv", row.names = FALSE)
all_flights <- read.csv('airline/saved_data/all_flights.csv')
head(all_flights)
##   Year FlightStatus Flights
## 1 1987      Arrived 1288326
## 2 1987    Cancelled   19685
## 3 1987     Diverted    3815
## 4 1988      Arrived 5137497
## 5 1988    Cancelled   50163
## 6 1988     Diverted   14436

def not_arrived():
    not_arrived = pd.DataFrame(r.all_flights).query("FlightStatus != 'Arrived'")
    # Only flights that did not arrive
    
    bins = np.arange(1987, 2008+1)
    plt.figure(figsize=[18,12])
    base_color = sns.color_palette()[2]
    
    sns.lineplot(data=not_arrived, x='Year', y='Flights', hue='FlightStatus')
    plt.xticks(bins)
    plt.show();

not_arrived()

From the above, we can see that there a relatively rapid increase in cancelled flights across the years getting to a peak in 2001 followed by a drastic drop the following year and the the increase continues. One major world event around the time of that peak is the dot-com bubble which could have had easily had a major effect on flights around 2000 to 2002. The change in the diverted flights is less drastic.

Summary

Some of the interesting discoveries I made include the fact that there was a very large number of flights that were not delayed but instead left earlier than scheduled. Even if this were to happen, I wouldn’t expect it to be more than a couple of minutes since anyone that is boarding a flight is expected to be at the airport a few minutes before departure time. But I found that there are flights that departed as early as 2 hours and some even up to 5 hours before scheduled.